Executive Summary

CitiBike program have tasked us with the responsibility of analyzing data that they have collected about rides in 2019. In order to identify patterns in the ride history data, we investigated the relationship between all given factors. We focused on the demographics of the riders as well as the how and when these riders tend to use the bikes. We believe these aspects could contribute to an increase in ridership and profit.

Our report provides findings that showcase how riders tend to be 39 years or younger, the majority of riders are subscribers, inclement weather negatively affects ride time, the popular commute times are during the morning and late afternoon/evenings and certain stations have a surplus or deficit of bikes. The report also includes recommendations for the CitiBike program based on findings in terms of marketing efforts, pricing strategies and managing surpluses and deficits.

Exploratory Visualizations

Gender

citi %>% ggplot(aes(x=gender,fill=gender)) + geom_bar(alpha=.8) + theme_fivethirtyeight() + scale_fill_brewer(palette="Set2")+theme(legend.position="none")+ggtitle(expression(atop("Gender of Citi Bikers"))) 

When looking at the distribution of riders based on gender, it is clear that the majority of riders in the 2019 data set were males (68%). Females and unidentified genders only make up a combined 32% of riders in 2019.

Age

citi %>% ggplot(aes(x=age_category,fill=age_category)) + geom_bar(alpha=.8)+theme_fivethirtyeight()+scale_fill_brewer(palette="Set2")+theme(legend.position="false")+ggtitle(expression(atop("Distribution by Age Group")))

In terms of the age distribution, it is clear that the majority of riders tend to come from younger age groups. Around 56.3% of riders in 2019 were 39 years and under, meaning only 43.7% of riders were 40 and over.

ggplot(data = citi, aes(x = age, y = distMiles)) + geom_point() + geom_smooth(method = 'lm') + ggtitle("Comparison of Age to Distance traveled") + theme_fivethirtyeight()
## `geom_smooth()` using formula 'y ~ x'

Ride History Patterns

Usertype

Commute Times

ggplot(citi) + geom_bar(aes(time, fill=gender)) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + ggtitle(expression(atop("Commute Times by Gender"))) + scale_fill_manual(values = c("lightgreen", "lightblue","pink")) + scale_fill_brewer(palette="Set2") + theme_fivethirtyeight() #+transition_states(states = time)+enter_fade() + shadow_mark()  

The most popular times for rides seem to occur during the morning commute time of 7am-10am as well as the late afternoon / evening commute of 4pm to 7pm. Using information from the previous graph that showcases how majority of riders are subscribers that utilize the bikes more heavily during the weekdays, we can see that these time gaps correlate to working hours. Thus, subscribers usually take the bikes to get to work in the morning and then to get back home in the late afternoons/evenings. This information could be useful for CitiBike to use to make sure bikes are available for use during the most frequented times of day.

Average Speed

ggplot(avg_speed_age) + geom_line(aes(as.numeric(age), mean_speed, color = gender)) + labs(title = "Average Speed based on Age and Gender in 2019", x = "Age", y = "Average Speed (Miles Per Hour)") + theme(plot.title = element_text(hjust = 0.7)) + theme_fivethirtyeight() + scale_color_discrete(name = "Gender", labels = c("Female", "Male")) #+ transition_reveal(age)

Rides by Weekday/Hour

ggplot(citi) + geom_bar(aes(x=weekday, y=(..count..)/sum(..count..), fill=usertype)) + ggtitle("Rides on days of week, by usertype") + theme_fivethirtyeight() + xlab("Weekday") + ylab("Percentage of All Rides")

ggplot(citi) + geom_bar(aes(x = hour, y=(..count..)/sum(..count..), fill = usertype)) + ylab("% of rides") + ggtitle("Rides per hour of the day, by usertype") 

Traffic

Top Ten Start/End Stations Barplots

plot1 <- ggplot(top_start_count, aes(reorder(Var1,-Freq), Freq)) + geom_bar(stat="identity", fill = "lightblue") + geom_text(aes(label=Freq), vjust=-0.3, size=3.5) + 
theme_minimal() + theme(axis.text.x = element_text(angle = 60, vjust = 0.5, hjust=0.5)) + ggtitle("Top 10 Start Stations") + xlab("Station Name") + ylab("Total Trips Started Out of station") 

plot2 <- ggplot(top_end_count, aes(reorder(Var1,-Freq), Freq)) + geom_bar(stat="identity", fill = "lightblue") + geom_text(aes(label=Freq), vjust=-0.3, size=3.5) + 
theme_minimal() + theme(axis.text.x = element_text(angle = 60, vjust = 0.5, hjust=0.5)) + ggtitle("Top 10 End Stations") + xlab("Station Name") + ylab("Total Trips Ended at Station")
grid.arrange(plot1, plot2, ncol= 2)

Start/End Stations Maps

#start stations
leaflet(citi) %>%
  addProviderTiles(providers$providers$CartoDB.DarkMatter) %>% 
  setView(lng = -73.98928, lat = 40.75042, zoom = 10) %>%
addMarkers(lng = citi$start.station.longitude, lat = citi$start.station.latitude,
     popup = "Starting")
# #end stations 
leaflet(citi) %>%
addProviderTiles(providers$providers$CartoDB.DarkMatter) %>% 
  setView(lng = -73.98928, lat = 40.75042, zoom = 10) %>%
addMarkers(lng = citi$end.station.longitude, lat = citi$end.station.latitude,
     popup = "Ending")

Top Routes

Deficit/Surplus

#map by deficit (all stations)
leaflet(bike_deficit) %>% 
  addTiles() %>%
  setView(-74, 40.75, zoom = 11.5) %>%
  addCircleMarkers(lng = bike_deficit$longitude, lat = bike_deficit$latitude, 
                   popup = paste(bike_deficit$station, "<br>", ifelse(bike_deficit$deficit>=0, "Bike deficit = ", "Bike surplus = "), 
                                abs(bike_deficit$deficit)), 
                   radius = abs(bike_deficit$deficit)/5, color = ifelse(bike_deficit$deficit>0, "red", "green"))
#top 5 surplus 
bike_surplus_5 <-arrange(bike_deficit, (deficit))[1:5,]
popup_1 <- paste0("<b>", bike_surplus_5$station, "</b><br>",
                "Deficit/Surplus: ", bike_surplus_5$deficit, "<br>",
                "Arrival Count: ", bike_surplus_5$count_arrival, "<br>",
                "Departure Count: ", bike_surplus_5$count_dep, "<br>")

leaflet() %>% 
  addTiles() %>%
  addMarkers(lng = bike_surplus_5$longitude, lat = bike_surplus_5$latitude, popup = popup_1) 
gt(bike_surplus_5) %>% 
 cols_label(
    station = "Station",
    latitude = "Latitude",
    longitude = "Longitude",
    count_dep = "Departures",
    count_arrival = "Arrivals", 
    deficit = "Surplus")
Station Latitude Longitude Departures Arrivals Surplus
Pershing Square North 40.7519 -73.9777 1214 1347 -133
Broadway & E 22 St 40.7403 -73.9895 906 1009 -103
West St & Chambers St 40.7176 -74.0132 874 975 -101
Old Fulton St 40.7028 -73.9938 262 348 -86
E 2 St & 2 Ave 40.7250 -73.9907 413 490 -77
#top 5 deficit
bike_deficit_5<-arrange(bike_deficit, -deficit)[1:5,]
popup_2 <- paste0("<b>", bike_deficit_5$station, "</b><br>",
                "Deficit/Surplus: ", bike_deficit_5$surplus, "<br>",
                "Arrival Count: ", bike_surplus_5$count_arrival, "<br>",
                "Departure Count: ", bike_surplus_5$count_dep, "<br>")
leaflet() %>% 
  addTiles() %>%
  addMarkers(lng = bike_deficit_5$longitude, lat = bike_deficit_5$latitude, popup = popup_2) 
gt(bike_deficit_5) %>% 
 cols_label(
    station = "Station",
    latitude = "Latitude",
    longitude = "Longitude",
    count_dep = "Departures",
    count_arrival = "Arrivals", 
    deficit = "Deficit")
Station Latitude Longitude Departures Arrivals Deficit
W 59 St & 10 Ave 40.7705 -73.9880 370 285 85
Columbus Ave & W 72 St 40.7771 -73.9790 510 432 78
E 32 St & Park Ave 40.7457 -73.9820 592 525 67
1 Ave & E 30 St 40.7414 -73.9754 392 326 66
Carmine St & 6 Ave 40.7304 -74.0022 649 586 63

Start Stations by Usertype

leaflet(citistations) %>% 
      addTiles() %>% 
      addCircleMarkers(lat = citistations$lat, lng = citistations$long, popup = citistations$name, radius = citistations$count/100, color = ifelse(citistations$usertype == "Subscriber", "blue", "red"))

Start Stations by Gender

leaflet(citistations_gender) %>% 
      addTiles() %>% 
      addCircleMarkers(lat = citistations_gender$lat, lng = citistations_gender$long, popup = citistations_gender$name, radius = citistations_gender$count/100, color = ifelse(citistations_gender$gender == "male", "blue", "red"))

Bike Usage

ggplot(citi) + geom_histogram(aes(x = bikeid), stat= "count") + ggtitle("Count of rides per bikeid")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

citi %>% group_by(bikeid) %>% summarise(total = sum(tripduration)) %>% ggplot(aes(reorder(bikeid, total), total)) + geom_col() + scale_y_log10(labels = comma) + ggtitle("Total Duration of Rides per Bike") + theme_fivethirtyeight() + theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) + theme(axis.title = element_text()) + ylab("Duration (seconds)")

citi %>% group_by(bikeid) %>% summarise(mean = mean(speedMilesperHour)) %>% filter(mean > 0) %>% ggplot(aes(reorder(bikeid, mean), mean)) + geom_col() + ggtitle("Average Speed per Bike") + theme_fivethirtyeight() + theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) + scale_y_continuous(labels = comma) + theme(axis.title = element_text()) + ylab("Miles per Hour")

Weather Impact

ggplot(citi) + geom_histogram(aes(x = TAVG, fill = usertype)) + xlim(0,100) + ggtitle("Ride by Average Temperatures, by usertype") + theme_fivethirtyeight() 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing missing values (geom_bar).

ggplot(data = citi, aes(y = SNOW, x = tripduration/60)) + geom_point(alpha = .1, color = "red") + labs(x = "duration") +labs(y = "snow accumulation") + scale_x_log10() + ggtitle("Duration of rides when snowing")  + theme_fivethirtyeight() + theme(axis.title = element_text()) + ylab("Amount of snow falling") + xlab("Duration of ride (minutes)")

 ggplot(data = citi, aes(y = PRCP, x = tripduration/60)) + geom_point(alpha = .1, color = "blue") + labs(x = "duration") +labs(y = "Precipitation accumulation") + scale_x_log10() + ggtitle("Duration of rides when raining")  + theme_fivethirtyeight() + theme(axis.title = element_text()) + ylab("Amount of rain falling") + xlab("Duration of ride (minutes)")

  ggplot(data = citi, aes(y = AWND, x = tripduration/60)) + geom_point(alpha = .1, color = "darkgreen") + labs(x = "duration") +labs(y = "Precipitation accumulation") + scale_x_log10() + ggtitle("Duration of rides by wind speed")  + theme_fivethirtyeight() + theme(axis.title = element_text()) + ylab("Wind speed") + xlab("Duration of ride (minutes)")

Recommendations

Based on our findings, we recommend that CitiBike take the following actions:

  • In terms of their marketing efforts, Citibike should consider implementing demographic specific initiatives. They should focus their attention on increasing marketing efforts that target women and elderly individuals. They could also help attract more female user by emphasizing safety initiatives to remove deterrents. In addition, they can offer senior discounts to target the older demographic.

  • Target certain customer types based on day of week : Improve low ridership from customers on weekdays, and from subscribers on weekends by offering discounts on bikes (for customers) and ebikes (subscribers)

  • Vary pricing strategies based on inclement weather : Lower prices/offer discounts on rainy or snowy days & days when temp is below 50 degrees

  • Manage surpluses and deficits : Incentivize riders to drop off/pick up bikes from certain stations that tend to have extra/not enough bikes available